Week 05: Semantic Analysis and Type Checking

Semantic Analysis

Error Detection

Lexical Analysis: Detects inputs with illegal tokens.
Parsing: Detects inputs with ill-formed parse trees.
Semantic Analysis: Catches all remaining errors.
Example. (Cool)
1. All identifiers are declared
2. Types
3. Inheritance relationships
4. Classes defined only once
5. ...

Scope

Matching identifier declarations with uses.
The scope of an identifier is the portion of a program in which that identifier is accessible.
The same identifier may refer to different things in different parts of the program.
Most languages have static scope.
A few languages are dynamically scoped.
- A dynamically-scoped variable refers to the closest enclosing binding in the execution of the program.
Cool identifier bindings are introduced by
- Class declarations (introduce class names)
- Method definitions (introduce method names)
- Let expressions (introduce object id’s)
- Formal parameters (introduce object id’s)
- Attribute definitions (introduce object id’s)
- Case expressions (introduce object id’s)
Not all identifiers follow the most-closely nested rule.
- Example. (class definitions in Cool)
  - Globally visible throughout the program.
Attribute names are global within the class in which they are defined.
Methods may also be redefined (overridden).

Symbol Table

Much of semantic analysis can be expressed as a recursive descent of an AST
- Before: Push the identifiers into symbol tables (stack).
- In the Process: Process the children of AST.
- After: Pop the identifier up from symbol tables (stack).
- Use a stack because there are identifier definitions with the same name.
Symbol Table Operations
- add_symbol(x)
- find_symbol(x): ask type in semantics / memory location in cgen
- remove_symbol(x)
- enter_scope()
- exit_scope()
- check_scope(x)
Because class can be used before being defined, how to manage class?
- Solution: do 2 tree traversals.
  - Pass 1: Gather all class names
  - Pass 2: Do the type checking
- Actually, semantic analysis requires multiple passes.

Type Checking

Types

Meaning: The notion varies in different languages.
Consensus
- A set of values
- A set of operations on those values
In assembly language level, there are no types. For add $r1, $r2, $r3, just add it bitwisely.
Same type, but not make sense.
- Function Pointer (int) + Number (int)
Type system in a language specifies which operations are valid for which types.
The goal of type checking is to ensure that operations are used only with the correct type.
Three kinds of languages
- Statically typed: All or almost all checking of types is done as part of compilation.
- Dynamically typed: Almost all checking of types is done as part of program execution.
- Untyped: No type checking.
Competing view
- Static Typing
  - Static checking catches many programming errors at compile time.
  - Avoids overhead of runtime type checks.
- Dynamic Typing
  - Static type systems are restrictive.
  - Rapid prototyping difficult within a static type system.
A lot of code is written in statically typed languages with an “escape” mechanism. Unsafe casts in C, Java.
People retrofit static typing to dynamically typed languages for optimization, debugging.
The compiler infers types for every expression.
Term.
- Type Checking is the process of verifying fully typed programs.
- Type Inference is the process of filling in missing type information.

Type Checking

We have seen two examples of formal notation specifying parts of a compiler.
- Lexical Analysis: Regular expressions
- Parsing: Context-free grammars
- Type checking: Logical rules of inference
Inference Rules
- Form: If hypothesis is true, then Conclusion is true.
- Type checking: If $E_1$ and $E_2$ have certain types, then $E_3$ has a certain type.
- Notation
  - And: $\wedge$
  - If-Then: $\Rightarrow$
  - $x$ has type $T$: $x:T$
  - It is provable that...: $\vdash$
Example.
- If $e_1$ has type Int and $e_2$ has type Int, then $e_1+e_2$ has type Int.
- ($e_1$ has type Int $\wedge$ $e_2$ has type Int) $\Rightarrow$ $e_1+e_2$ has type Int.
- $(e_1:Int \wedge e_2:Int) \Rightarrow e_1+e_2:Int$
- $Hypothesis_1 \wedge \dots \wedge Hypothesis_n \Rightarrow Conclusion$
Tradition inference rules are writen by $$\frac{\vdash Hypothesis \dots \vdash Hypothesis}{\vdash Conclusion}$$
Cool type rules: $\vdash e:T$
Example.
- [Int] $$\frac{i:Constant}{\vdash i:Int}$$
- [Add] $$\frac{\vdash e_1:Int \vdash e_2:Int}{\vdash e_1+e_2:Int}$$
- [1+2] $$\frac{\frac{1:Constant}{\vdash 1:Int}\frac{2:Constant}{\vdash 2:Int}}{\vdash 1+2:Int}$$
A type system is sound if
- When $\vdash e:T$, then $e$ evaluates to a value of type $T$.
- We only want sound rules and the more precise one. $(\vdash i:Int) > (\vdash i:Object)$
Types are computed in a bottom-up pass over the AST.

Type Environment

Other Rules
- [False] $\frac{}{\vdash false: Bool}$
- [String] $\frac{s:StringConstant}{\vdash s:String}$
- (Ignore SELF_TYPE) new T produce an object of type T
  - [New] $\frac{}{\vdash new T:T}$
- [Not] $\frac{\vdash e: Bool}{\vdash !e:Bool}$
- [Loop] $\frac{\frac{\vdash e_1:Bool}{\vdash e_2:T}}{\vdash while\ e_1 loop\ e_2 pool:Object}$
- [Var] $\frac{x\ is\ a\ variable}{\vdash x: ?}$
  - We need the type rule.
Type Environment
- A type environment gives types for free variables.
  - A type environment is a function from Object Identifiers to Types
  - A variable is free in a expression if it is not defined within the expression.
- Let $O$ be a function from Object Identifiers to Types.
- Writen by: $O\vdash e:T$
- Add to every rules.
  - [Int] $$\frac{i:Constant}{O\vdash i:Int}$$
  - [Add] $$\frac{O\vdash e_1:Int\ O\vdash e_2:Int}{O\vdash e_1+e_2:Int}$$
  - [Var] $$\frac{O(x)=T}{O\vdash x:T}$$
  - [Let-No-Init] $$\frac{O[T_0/x]\vdash e_1:T_1}{O\vdash let\ x:T_0\ in\ e_1:T_1}$$
  - Remark
    - $O[T/x](x) = T$
    - $O[T/x](y) = O(y)$
- Remark
  - The type environment gives types to the free identifiers in the current scope.
  - The type environment is passed down the AST from the root towards the leaves.
  - Types are computed up the AST from the leaves towards the root.

Subtyping

Consider let rule
- [Let-Init] $$\frac{O\vdash e_0:T_0\\O[T_0/x]\vdash e_1:T_1}{O\vdash let\ x:T_0\leftarrow e_0\ in\ e_1:T_1}$$
Define a relation $\leq$ on classes
- $X\leq X$
- $X\leq Y$ if $X$ inherits from $Y$
- $X\leq Z$ if $X\leq Y$ and $Y\leq Z$
  - [Let-Init] $$\frac{O\vdash e_0:T_0\\O[T_0/x]\vdash e_1:T_1\\T_0\leq T}{O\vdash let\ x:T\leftarrow e_0\ in\ e_1:T_1}$$
  - [Assign] $$\frac{O(x)=T_0\\O\vdash e_1:T_1\\T_1\leq T_0}{O\vdash x\leftarrow e_1:T_1}$$
Let $O_c(x)=T$ for all attributes $x:T$ in class $C$
- [Attr-Init] $$\frac{O_c(x)=T_0\\O_c\vdash e_1:T_1\\T_1\leq T_0}{O_c\vdash x:T_0\leftarrow e_1}$$
How to decide the return type of if statement?
- if $e_0$ then $e_1$ else $e_2$ fi
- The best way we can do is the smallest supertype larger than the type of $e_1$ or $e_2$.
- $lub(X,Y)$, the least upper bound of $X$ and $Y$, is $Z$ if
  - $X\leq Z \wedge Y \leq Z$, $Z$ is an upper bound.
  - $X\leq Z' \wedge Y \leq Z' \Rightarrow Z\leq Z'$, $Z$ is an upper bound.
- In COOL, the least upper bound of two types is their least common ancestor in the inheritance tree.
- [If-Then-Else] $$\frac{O\vdash e_0:Bool\\O\vdash e_1:T_1\\O\vdash e_2:T_2}{O\vdash if\ e_0\ then\ e_1\ else\ e_2\ fi:lub(T_1,T_2)}$$
- [Case] $$\frac{O\vdash e_0:T_0\\O[T_1/x_1]\vdash e_1:T_{1`}\\\dots\\ O[T_n/x_n]\vdash e_n:T_{n'}}{O\vdash case\ e_0\ of\ x_1:T_1\rightarrow e_1;\dots ;x_n:T_n\rightarrow e_n;esac:lub(T_1,T_2)}$$

Typing Methods

In COOL, method and object identifiers live in different namespaces.
- A method foo and an object foo can coexist in the same scope.
- We use a separate mapping $M$ For method signatures $$M(C,f)=(T_1,\dots,T_n,T_{n+1})$$ means in class $C$ there is a method $f$ $$f(x_1:T_1,\dots,x_n:T_n):T_{n+1}$$
- [Dispatch] $$\frac{O,M\vdash e_0:T_0\\ O,M\vdash e_1:T_1\\ \dots\\ O,M\vdash e_n:T_n\\ M(T_0, f)=(T_{1'},\dots,T_{n'},T_{n+1})\\ T_i\leq T_{i'},\ for\ 1\leq i\leq n}{O,M\vdash e_0.f(e_1,\dots,e_n):T_{n+1}}$$
- [Static Dispatch] $$\frac{O,M\vdash e_0:T_0\\ O,M\vdash e_1:T_1\\ \dots\\ O,M\vdash e_n:T_n\\ T_0\leq T\\ M(T, f)=(T_{1'},\dots,T_{n'},T_{n+1})\\ T_i\leq T_{i'},\ for\ 1\leq i\leq n}{O,M\vdash e_0@T.f(e_1,\dots,e_n):T_{n+1}}$$
The method environment must be added to all rules.
In most cases, $M$ is passed down but not actually used. (Only the dispatch rules use $M$)
- [Add] $$\frac{O,M\vdash e_1:Int\ O,M\vdash e_2:Int}{O,M\vdash e_1+e_2:Int}$$
To derive the type of SELF_TYPE, we need to know the class in which an expression appears.
The full type environment for COOL
- A mapping $O$ giving types to object id's
- A mapping $M$ giving types to methods
- The current class $C$
- [Add] $$\frac{O,M,C\vdash e_1:Int\ O,M,C\vdash e_2:Int}{O,M,C\vdash e_1+e_2:Int}$$
General Themes
- Type rules are defined on the structure of expressions.
- Types of variables are modeled by an environment.

Implementation of Typing Checking

COOL type checking can be implemented in a single traversal over the AST.
Type environment is passed down the tree from parent to child.

Types are passed up the tree from child to parent.

[Add] $$\frac{O,M,C\vdash e_1:Int\ O,M,C\vdash e_2:Int}{O,M,C\vdash e_1+e_2:Int}$$

TypeCheck(Environment, e1+e2) = {
  T1 = TypeCheck(Environment, e1);
  T2 = TypeCheck(Environment, e2);
  Check T1 == T2 == Int;
  return Int;
}

[Let-Init] $$\frac{O\vdash e_0:T_0\\O[T_0/x]\vdash e_1:T_1\\T_0\leq T}{O\vdash let\ x:T\leftarrow e_0\ in\ e_1:T_1}$$

TypeCheck(Environment, let x:T <- e0 in e1) = {
  T0 = TypeCheck(Environment, e0);
  T1 = TypeCheck(Environment.add(x:T), e1);
  Check subtype(T0,T);
  return T1;
}